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Abstract. With Grids, we are able to 
share computing resources and to provide 
for scientific communities a global trans- 
parent access to local facilities. In such an 
environment the problems of fair resource 
sharing and best usage arise. In this pa- 
per, the analysis of the LPC cluster usage 
( Clermont-Ferrand, France ) in the EGEE 
Grid environment is done, and from the 
results a model for job arrival is proposed. 



1 Introduction 

Analysis of a cluster workload is essential to under- 
stand better user behavior and how resources are 
used pp. We are interested to model and simulate 
the usage of a Grid cluster node in order to com- 
pare different scheduling policies and to find the 
best suited one for our needs. 

The Grid gives new ways to share resources be- 
tween sites, both as computing and storage re- 
sources. Grid defines a global architecture for dis- 
tributed scheduling and resource management [2] 
that enable resources scaling. We would like to un- 
derstand better such a system so that a model can 
be defined. With such a model, simulation may 
be done and a quality of service and fairness could 
then be proposed to the different users and groups. 

Briefly, we have some groups of users that each 
submit jobs to a group of clusters. These jobs 
are placed inside a waiting queue on some clusters 
before being scheduled and then processed. Each 
group of users have their own need and their own 
strategy to job submittal. We wish: 

1. to have good metrics that describes the group 
and user usage of the site. 



'This work was supported by EGEE. 



2. to model the global behavior (average job wait- 
ing time, average waiting queue length, system 
utilization, etc.) in order to know what is the 
influence of each parameter and to avoid site 
saturation. 

3. to simulate jobs arrivals and characteristics to 
test and compare different scheduling strate- 
gies. The goal is to maximize system utiliza- 
tion and to provide fairness between site users 
to avoid job starvation. 

As parallel scheduling for p machines is a hard 
problem |3J HJ, heuristics are used [SJ 0. More- 
over we have no exact value about the duration of 
jobs, making the problem difficult. We need a good 
model to be able to compare different scheduling 
strategies. We believe that being able to charac- 
terize users and groups behavior we could better 
design scheduling strategies that promote fairness 
and maintain a good throughput. From this paper 
some metrics are revealed, from the job submittal 
protocol a detailed arrival model for single user and 
group is proposed and scheduling problems are dis- 
cussed. We then suggest a new design based on our 
observation and show relationship between fairness 
issue and system utilization as a flow problem. 

Our cluster usage in the EGEE (Enabling Grids 
for E-science in Europe) Grid is presented in sec- 
tion |2 the Grid middleware used is described. 
Corresponding scheduling scheme is shown in sec- 
tional Then the workload of the LPC (Laboratoire 
de Physique Corpusculaire) computing resource, is 
presented (sectionQJ and the logs are analyzed sta- 
tistically. A model is then proposed in section [5] 
that describes the job arrival rate to this cluster. 
Simulation and validation are done in section|H|with 
comparison with related works in section0 Results 
are discussed in section |SJ Section |5| concludes this 
paper. 
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2 ENVIRONMENT 



2 Environment 

2.1 Local situation 

The EGEE node at LPC Clermont-Ferrand is a 
Linux cluster made of 140 dual 3.0 GHz CPUs 
with 1 GB of RAM and managed by 2 servers with 
the LCG (LHC Computing Grid Project) middle- 
ware. We are currently using MAUI as our cluster 
scheduler El ■ It is shared with the regional Grid 
INSTRUIRE ( htt p : //www . instruire . orgH . Our 
LPC Cluster role in EGEE is to be used mostly by 
Biomedical users* located in Europe and by High 
Energy Physics Communities. Biomedical research 
is one core application of the EGEE project. The 
approach is to apply the computing methods and 
tools developed in high energy physics for biome- 
dical applications. Our team has been involved in 
international research group focused on deploying 
biomedical applications in a Grid environment. 

One pilot application is GATE which is based 
on the Monte Carlo GEANT4 toolkit developed 
by the high energy physics community. Radiothe- 
rapy and brachytherapy use ionizing radiations to 
treat cancer. Before each treatment, physicians and 
physicists plan the treatment using analytical treat- 
ment planning systems and medical images data of 
the tumor. By using the Grid environment pro- 
vided by the EGEE project, we will be able to 
reduce the computing time of Monte Carlo simu- 
lations in order to provide a reasonable time con- 
suming tool for specific cancer treatment requiring 
Monte-Carlo accuracy. 

Another group is Dteam, this group is partly re- 
sponsible of sending tests and monitoring jobs to 
our site. Total CPU time used by this group is 
small relatively to the other one, but the jobs sent 
are important for the site monitoring. There are 
also groups using the cluster from the LHC exper- 
iments at CERN (http://www.cern.ch). There 
are different kind of jobs for a given group. For ex- 
ample, Data Analysis requires a lot of I/O whereas 
Monte-Carlo Simulation needs few I/O. 

2.2 EGEE Grid technology 

In Grid world, resources are controlled by their 
owners. For instance different kind of scheduling 

*Our cluster represented 75% of all the Biomed Virtual 
Organization (VO) jobs in 2004. 



policies could be used for each site. A Grid resource 
center provides to the Grid computing and/or stor- 
age resources and also services that allow jobs to be 
submitted by guests users, security services, moni- 
toring tools, storage facility and software manage- 
ment. The main issue of submitting a job to a 
remote site is to provide some warranty of security 
and correct execution. In fact the middleware au- 
tomatically resubmits job when there is a problem 
with one site. Security and authentication are also 
provided as Grid services. 

The Grid principle is to allow user a worldwide 
transparent access to computing and storage re- 
sources. In the case of EGEE, this access is aimed 
to be transparent by using LCG middleware built 
on top of the Globus Toolkit JU] . Middleware acts 
as a layer of software that provides homogeneous 
access to different Grid resource centers. 

2.3 LCG Middleware 

LCG is organized into Virtual Organizations 
(VOs): dynamic collections of individuals and in- 
stitutions sharing resources in a flexible, secure and 
coordinated manner. Resource sharing is facili- 
tated and controlled by a set of services that al- 
low resources to be discovered, accessed, allocated, 
monitored and accounted for, regardless of their 
physical location. Since these services provide a 
layer between physical resources and applications, 
they are often referred to as Grid Middleware 11 . 

Bag of task applications are parallel applications 
composed of independent jobs. No communications 
are required between running jobs. Since jobs from 
a same task may execute on different sites commu- 
nications between jobs are avoided. In this con- 
text, users submit their jobs to the Grid one by 
one through the middleware. Our cluster receives 
jobs only from the Grid. This means that each 
job requests for one and only one processor. Users 
could directly specify the execution site or let a 
Grid service choose the best destination for them. 
Users give only a rough estimation of the maximum 
job running time. In general this estimated time is 
overestimated and very imprecise [T2] , Instead of 
speaking about an estimated time, it could be bet- 
ter to speak about an upper bound for job duration, 
so this value provided by users is more a precision 
value. The bigger the value is the more imprecise 
the value of the actual runtime could be. 
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Figure shows the scenario of a job submittal. 
In this figure rounded boxes are grid services and 
ellipses are the different jobs states. As there is no 
communications between jobs, jobs could run inde- 
pendently on multiple clusters. Instead of commu- 
nicating between job execution, jobs write output 
files to Storage Elements (SE) of the Grid. Small 
output files could also be sent to the UI. Replica Lo- 
cation Service (RLS) is a Grid service that allow lo- 
cation of replicated data. Other jobs may read and 
work on the data generated, forming "pipelines" of 
jobs. 

The users Grid entry point is called an User In- 
terface (UI). This is the gateway to Grid services. 
From this machine, users are given the capability 
to submit jobs to a Computing Element and to fol- 
low their jobs status ^Hj- A Computing Element 
(CE) is composed of Grid batch queues. A Com- 
puting Element is built on a homogeneous farm of 
computing nodes called Worker Nodes (WN) and 
on a node called a GateKeeper acting as a security 
front-end to the rest of the Grid. 

Users can query the Information System in order 
to know both the state of different grid nodes and 
where their jobs are able to run depending on job 
requirements. This match-making process has been 
packaged as a Grid service known as the Resource 
Broker (RB). Users could either submit their jobs 
directly to different sites or to a central Resource 
Broker which then dispatches their jobs to match- 
ing sites. 

The services of the Workload Management Sys- 
tem (WMS) are responsible for the acceptance of 
job submits and the dispatching of these jobs to the 
appropriate CEs, depending on job requirements 
and on available resources. The Resource Broker is 
the machine where the WMS services run, there is 
at least one RB for each VO. The duty of the RB 
is to find the best resource matching the require- 
ments of a job (match-making process). (For more 
details see |T3] 1 

Users are then mapped to a local account on the 
chosen executing CE. When a CE receives a job, 
it enqueues it inside an appropriate batch queue, 
chosen depending on the job requirements, for in- 
stance depending on the maximum running time. A 
scheduler then proceeds all these queues to decide 
the execution of jobs. Users could question about 
status of their jobs during all the job lifetime. 



3 Scheduling scheme 

The goal of the scheduler is first to enable execu- 
tion of jobs, to maximize job throughput and to 
maintain a good equilibrium between users in their 
usage of the cluster ^S]. At the same time sched- 
uler has to avoid starvation, that is jobs, users or 
groups that access scarcely to available cluster re- 
sources compared to others. 

Scheduling is done on-line, i.e the scheduler has 
no knowledge about all the job input requests but 
jobs are submitted to the cluster at arbitrary time. 
No preemption is done, the cluster uses a space- 
sharing mode for jobs. In a Grid environment long- 
time running jobs are common. The worst case is 
when the cluster is full of jobs running for days 
and at the same time receiving jobs blocked in the 
waiting queue. 

Short jobs like monitoring jobs barely delay too 
much longer jobs. For example, a 1 day job could 
wait 15 minutes before starting, but it is unwise if 
a 5 minutes job has to wait the same 15 minutes. 
This results in production of algorithms classes that 
encourage the start of short jobs over longer jobs. 
(Short jobs have higher priority ^Hj) Some other so- 
lution proposed is to split the cluster in static sub- 
clusters but this is not compatible with a sharing 
vision like Grids. Ideal on-line scheduler will maxi- 
mize cluster usage and fairness between groups and 
users. Of course a good trade-off has to be found 
between the two. 

3.1 Local situation 

We are using two servers to manage our 140 CPUs, 
on each machine there are 5 queues where each 
group could send their jobs to. Each queue has 
its own limit in maximum CPU Time. A job in 
a given queue is killed if it exceeds its queue time 
limit. There are in fact two limits, one is the max- 
imum CPU time, the other one is the maximum 
total time (or Wall time) a job could use. For each 
queue there is also a limit in the number of jobs 
than can run at a given time. This is done in order 
to avoid that the cluster is full with long running 
jobs and short jobs cannot run before days. Likely 
there is the same limit in number of running jobs 
for a given group. 

Maui Scheduler and the Portable Batch Sys- 
tem (PBS) run on multiple hardware and oper- 
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3 SCHEDULING SCHEME 
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Figure 1: Job submittal scenario 



Queue 


Max CPU 


Max Wall 


Max Jobs 




(H 


M) 


(H:M) 




Test 


00 


05 


00:15 


130 


Short 


00 


20 


01:30 


130 


Long 


08 


00 


24:00 


130 


Day 


24 


00 


36:00 


130 


Infinite 


48 


00 


72:00 


130 



Table 1: Queue configuration (maximum CPU 
time, Wall time and running jobs) 



ating systems. MAUI is a scheduling policy en- 
gine that is used together with the PBS batch sys- 
tem. PBS manages the job reception in queues 
and execution on cluster nodes. MAUI is a First- 



Come-First-Served backfill scheduler with priori- 
ties. This means that is checks periodically the 
running queues, execution of lower priority jobs is 
allowed if it is determined that their running will 
not delay jobs higher in the queue [BJ. Maui is un- 
fortunately not event driven, it polls regularly the 
PBS queues to decide which jobs to run. MAUI 
allows to add a priority property for each queue. 
Our site configuration is that the shorter the queue 
allows jobs to run, the more priority is given to that 
job. Jobs are then selected to run depending on a 
priority based on the job attributes such as owner, 
group, queue, waiting time, etc. 



4.1 Running time 
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4 Workload data analysis 

Workload analysis allows to obtain a model of the 
user behavior ^2]- Such a model is essential for 
understanding how the different parameters change 
the resource center usage. Meta-computing work- 
load ^HI like Grid environments is composed of dif- 
ferent site workloads. We are interested in mod- 
elling workload of our site which is part of the 
EGEE computational Grid. Our site receives only 
jobs coming from the EGEE Grid and each requests 
for only one CPU. 

Traces of users activities are obtained from 
accountings on the server logs. Logs contain in- 
formation about users, resources used, jobs arrival 
time and jobs completion time. It is possible to use 
directly these traces to obtain a static simulation 
or to use a dynamic model instead. Workload 
models are more flexible than logs, because they 
allow to generate traces with different parameters 
and better understand workload properties pQ. 
Workload analysis allows to obtain a model of 
users activity. Such a model is essential for un- 
derstanding how the different parameters change 
the resource center usage. Our workload data has 
been converted to the Standard Workload Format 



Group 


Mean 


Standard 


Number 






Deviation 


of jobs 


Biomed 


5417 


22942.2 


108197 


Dteam 


222 


3673.6 


94474 


LHCb 


2072 


7783.4 


9709 


Atlas 


13071 


28788.8 


7979 


Dzero 


213 


393.9 


1332 



Table 2: Group running time in seconds and total 
number of jobs submitted 

there are not much differences between CPU time 
and total time, so it means that jobs sent to our 
cluster are really CPU intensive jobs and not I/O 
intensive. Dteam jobs are mainly short monitor- 
ing jobs but all Dteam jobs are not regularly sent 
jobs. We have 6784.6 days CPU time consumed by 
Biomed for 108197 jobs (Mean of one hour and half 
per jobs, table El ■ Repartition of cumulative job 
duration distributions for Biomed VO is shown on 
figure^] The duration of about 70% of Biomed jobs 
are less than 15 minutes and 50% under 10 seconds, 
there are a dominant number of small running jobs 
but the distribution is very wide as shown by the 



high standard deviation compared to the mean in 
(http : //www. cs . huj i . ac . il/labs/parallel/workJo^iA^i 

and made publicly available for further researches. 

Workload is from August 1st 2004 to May 15th 

2005. We have a cluster containing 140 CPUs since 

September 15th. This can be visible in the figure|21 



3(a) and |3(b)] where we notice that the number of 
jobs sent increases. Statistics are obtained from the 
PBS log files. PBS log files are well structured for 
data analysis. An AWK script is used to extract 
information from PBS log files. AWK acts on lines 
matched by regular expressions. We do not have 
information about users Login time because users 
send jobs to our cluster from an User Interface (UI) 
of the EGEE Grid and not directly. 



Queue 


Mean 


Standard 
Deviation 


CV 


Test 


31.0 


373.6 


12.0 


Short 


149.5 


1230.5 


8.2 


Long 


2943.2 


11881.2 


4.0 


Day 


6634.8 


25489.2 


3.8 


Infinite 


10062.2 


30824.5 


3.0 



Table 3: Queue mean running time in seconds, cor- 
responding Standard Deviation and Coefficient of 
Variation 



4.1 Running time 

During 280 days, our site received 230474 jobs from 
which 94474 Dteam jobs and 108197 Biomed jobs 
(table [21 • For all these jobs there are 23208 jobs 
that failed and were dequeued. It appears that jobs 
are submitted irregularly and by bursts, that is lot 
of jobs submitted in a short period of time followed 
by a period of relative inactivity. From the logs, 



Users submit their jobs with an estimated 
run length. For relationships between execution 
time and requested job duration and its accuracy 
see |19|. To sum up estimated jobs duration are es- 
sentially inaccurate. It is in fact an upper bound for 
job duration which could in reality take any value 
below it. Table shows for each queue the mean 
running time, its standard deviation and coefficient 
of variation (CV) which is the ratio between stan- 
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4 WORKLOAD DATA ANALYSIS 
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Figure 2: Number of jobs received per VO and per week from August 2004 to May 2005 
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Figure 3: Cluster utilization as CPU consumed per VO and per week from August 2004 to May 2005 



4.2 Waiting time 
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(a) Dteam job runtime 



(b) Biomed job runtime 



Figure 4: Dteam and Biomed job runtime distributions (logscale on time axis) 



dard deviation and the mean. CV decreases as the 
queue maximum runtime increase. This means that 
jobs in shorter queues vary a lot in their duration 
compared to longer jobs and we can expect that 
more the upper bound given is high the more con- 
fidence in using the queue mean runtime as a an 
estimation we could have. 

A commonly used method for modelling duration 
distribution is to use log-uniform distribution. Fig- 
4(a)1and|4(b) 



urcs 



show the fraction of Dteam and 
Biomed jobs with duration less than some value. 
Job duration has been modelled with a multi-stage 
log-uniform model in 20 which is picccwisc linear 
in log space. In this case Dteam and Biomed job 
duration could be approximated respectively with 
a 3 and a 6 stages log-uniform distribution. 



4.2 Waiting time 

Table 0] shows that jobs coming from the Dteam 
group are the more unfairly treaten. Dteam group 
sends short jobs very often, Dteam jobs are then all 
placed in queue waiting that long jobs from other 
groups finished. Dzero group sends short jobs more 
rarely and is also less penalized than Dteam be- 
cause there are less Dzero jobs that are waiting 
together in queue before being treated. The best 
treated group is LHCb with not very long running 
jobs (average of about 34 minutes) and one job 



Group 


Mean 


Stretch 


Standard 
Deviation 


CV 


Biomed 


781.5 


0.874 


16398.8 


20.9 


Dteam 


1424.1 


0.135 


26104.5 


18.3 


LHCb 


217.7 


0.905 


2000.7 


9.1 


Atlas 


2332.8 


0.848 


13052.1 


5.5 


Dzero 


90.7 


0.701 


546.3 


6.0 



Table 4: Group mean waiting time in seconds, cor- 
responding Standard Deviation and Coefficient of 
Variation 



about every 41 minutes. The best behavior to re- 
duce waiting time per jobs seems to send jobs that 
are not too short compared to the waiting factor, 
and send not too very often in order to avoid that 
they all wait together inside a queue. Very long 
jobs is not a good behavior too as the scheduler 
delay them to run shorter one if possible. 

Table El shows the mean waiting time per jobs 
on a given queue. There is a problem with such 
a metric, for example: Consider one job arriving 
on a cluster with only one free CPU, it will run 
on it during a time T with no waiting time. Con- 
sider now that this job is splitted in N shorter jobs 
(numbered ... N — 1) with equal total duration 
T . Then the waiting time for the job number i will 
be iT/N, and the total waiting time (N - l)T/2. 
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4 WORKLOAD DATA ANALYSIS 



Queue 


Mean 


Standard 
Deviation 


CV 


Number 
of jobs 


Test 


33335.9 


148326.4 


4.4 


45760 


Short 


1249.7 


27621.8 


22.1 


81963 


Long 


535.1 


5338.8 


9.9 


32879 


Day 


466.8 


8170.7 


17.5 


19275 


Infinite 


1753.9 


24439.8 


13.9 


49060 



Table 5: Queue mean waiting time in seconds, cor- 
responding Standard Deviation, Coefficient of Vari- 
ation and number of jobs 



Jot) arrivals per time 61 day 




12 13 14 15 16 



19 20 21 22 23 24 



So the more a job is splitted the more it will wait 
in total. Another metric that does not depend on 
the number of jobs is the total waiting time divided 
by the number ofjobs and by the total job dura- 
tion. Let note WT this normalized waiting time, 
We obtain: 



Figure 5: Job arrival daily cycle 



WT = 



WT : 



TotalWaitingTime 
N Jobs * TotalDuration 

MeanWaitingTime 
NJobs * M eanDuration 



(4.1) 



Group 


Mean 


Standard 


CV 




(seconds) 


Deviation 




Biomed 


223.6 


5194.5 


23.22 


Dteam 


256.2 


2385.4 


9.31 


LHCb 


2474.6 


39460.5 


15.94 


Atlas 


2824.1 


60789.4 


21.52 


Dzero 


5018.7 


50996.6 


10.16 



Queue 


WT 


Group 


WT 


Test 


2.35e-2 


Biomed 


1.58e-5 


Short 


1.02e-4 


Dteam 


6.79e-5 


Long 


5.53e-6 


LHCb 


1.08e-5 


Day 


3.65e-6 


Atlas 


2.23e-5 


Infinite 


3.55e-6 


Dzero 


31.9e-5 



Table 6: Queue and Group normalized waiting time 

With this metric, the Test queue is still the most 
unfairly treated and the Infinite queue has the more 
benefits compared to the other queues. Dteam 
group is again bad treated because their jobs are 
mainly sent to the Test queue. The more unfairly 
treated group is Dzero. 

4.3 Arrival time 

Job arrival daily cycle is presented in figurc[5] This 
figure shows the number of arrival depending on job 
arrival hours, with a 10 minutes sampling. Clearly 
users prefer to send their jobs at o'clock. In fact 
we receive regular monitoring jobs from the VO 
Dteam. The monitoring jobs are submitted every 



Table 7: Group interarrival time in seconds, cor- 
responding Standard Deviation and Coefficient of 
Variation 



hour from goc . grid- suppo rt . ac . ukj Users are lo- 
cated in all Europe, so the effect of sending at work- 
ing hours is summed over all users timezones. How- 
ever the shape is similar compared to other daily 
cycle, during night (before 8am) less jobs are sub- 
mitted and there is an activity peak around midday, 
2pm and 4pm. 

Table shows the moments of interarrival time 
for each group. CV is much higher than 1, this 
means that arrivals are not Poisson processes and 
are very irregularly distributed. For instance we 
could receive 10 jobs in 10 minutes followed by 
nothing during the 50 next minutes. In this case we 
have a mean interarrival time of 6 minutes but in 
fact when jobs arrived they arrived every minutes. 



Figure 3(a) shows the system utilization of our 
cluster during each week. There are a maximum of 
980 CPU days consumed each week for 140 CPUs. 
We have a highly varying cluster activity. 
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Frequencies (Biomed) 



10 15 20 25 30 35 40 
Jobs sent per 5 minutes 



Frequencies (Dteam) 



5 10 15 20 25 

Jobs sent per 5 minutes 



(a) Biomed job arrival rate each 5 minutes 



(b) Dteam job arrival rate each 5 minutes 



Figure 6: Arrival frequencies for Biomed and Dteam VOs (Proportion of occurrences of n jobs received 
during an interval of 5 minutes) 



4.4 Frequency analysis 

Job arrival rate is a common measurement for a 



site usage in queuing theory. Figures 6(a) and 6(b) 
present the job arrival rate distribution. It is the 
number of time n jobs are submitted during inter- 
val length of 5 minutes. They show that most of the 
time the cluster does not receive jobs but jobs ar- 
rived grouped. Users actually submit groups of jobs 
and not stand-alone jobs. It explains the shape of 
the arrival rate: it fastly decreases but too slowly 
compared to a Poisson distribution. Poisson dis- 
tribution is usually used for modelling the arrival 
process but evidences are against that fact |2T) . 

Dteam monitoring jobs are short and regular 
jobs, there is no need of a special arrival model 
for such jobs. What we observe for other kind of 
jobs is that the job arrival law is not a Poisson Law 
(see table where CV 3> 1) as for instance a web 
site traffic (221- What really happens is that users 
come using the cluster from an User Interface dur- 
ing some time interval. During this time they send 
jobs to the cluster. Users log to an User Interface 
machine in order to send their jobs to a RB that 
dispatch them to some CEs. Note that one can 
send jobs to our cluster only from an User Inter- 
face, it means for instance that jobs running on a 
cluster cannot send secondary jobs. On a comput- 
ing site we do not have this user login information, 



but only job arrival. 

First we look at modelling user arrival and sub- 
mission behavior. Secondly we show that the model 
proposed shows good results for a group behavior. 



5 Model 

5.1 Login model 

In this section we begin to model user Login/ Logout 
behavior from the Grid job flow (figure QJ. We 
neglect the case where an user has multiple login 
on different UI at the same time. We mean that 
a user is in the state Login if he is in the state of 
sending jobs from an UI to our cluster, else he is in 
the state Logout. 

Markov chains are like automatons with for each 
state a probability of transition. One property of 
Markov chains is that future states depend only 
on the current state and not on the past history. 
So a Markov state must contains all the infor- 
mation needed for future states. We decided to 
model the Login/Logout behavior as a continuous 
Markov chain. During each dt, a Logout user has 
a probability during dt of Xdt to login and a Login 
user has a probability during dt of Sdt to logout 
(see figure [7J . A is called the Login rate and 6 is 
called the Logout rate. 
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5 MODEL 



1 -X dt 



1 - 5 dt 




X dt 

Figure 7: Login/Logout cycle 



All these parameters could vary over time as we 
see with the variation of the week job arrival (fig- 
ure 3(a) I or during day time (figure [5) The model 



proposed could be used more accurately with non- 
constant parameters at the expense of more calcu- 
lation and more difficult fitting. For example, one 
could numerically use Fourier series for the Login 
rate or for the submittal rate to model this daily 
cycle. We use now constant parameters for calcu- 
lation, looking for general properties. 

We would like to have the probabilities during 
time that the user is logged or not logged. Let 
Vhoginify and Vhogoutif) be respectively probability 
that the user is logged or not logged at time t. We 
have from the modelling: 

V L ogout(t + dt) = (1 - Xdt)V L ogout{t) 

+ 5dtV Logm {t) (5.1) 
V Login (t + dt) = (1 - 5dt)V Logm {t) 

+ XdtVLogoutit) (5.2) 

At equilibrium we have no variation so 

V Logout^ + dt) = Vhogoutif) = V Logout (5.3) 
T>Login{t + dt) = VLogin(t) = V Login (5.4) 

We obtain: 

Vhogout = ^ — ^ (5.5) 
Vhogm = , 7 r (5.6) 



5.2 Job submittal model 

During period when users are logged they could 
submit jobs. We model the job submittal rate for 
one user as follows: During dt when the user is 
logged he has a probability of /idt to submit a job. 
With S = we have a delayed Poisson process, 
with n = no jobs are submitted. The full model 
is shown at figure |H1 it shows all the possible out- 
comes with corresponding probabilities from one of 
the possible state to the next after a small period 
dt. Numbers inside circles are the number of jobs 
submitted from the start. Login states are below 
and Logout states are at the top. We have: 

• V n {t) is the probability to be in the state " User 
is not logged at time t and n jobs have been 
submitted between time and V 

• Qn(t) is the probability to be in the state 
" User is logged at time t and n jobs have been 
submitted between time and t" 

• lZ n (t) is the probability to be in the state "n 
jobs have been submitted between time and 
t" We have K n = V n + Q n . 

From the model, we obtain with the same 
method as before this recursive differential equa- 
tion: 



M 



-X 



■fa + 8) 



Qo 

Qn 



M 



M 



Vq 

Qo 

Qn 





fJ-Qn-l 



(5.7) 
(5.8) 
(5.9) 



This results to the following recursive equation 
(in case the parameters are constants, M. is a con- 
stant) 



Qn 



-Mx 





VQn- 



dx (5.10) 



We take a look at the probability of having no 
job arrival during an interval of time t which is Vo 
and Qo- ^o is the the probability that no jobs have 
been submitted between arbitrary time and t. So 
from the above model, we have: 



X + 5 



V () 
Qo 



-X 



s 

■fa + 8) 



V 
Qo 



(5.11) 



5.3 Model characteristics 
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l-X dt 



1-X dt 



l-X dt 



1 - X dt 



Logout ( o 




Login ( o 



l-(8 + n) dt l-(8 + n) dt l-(8 + n) dt l-(8 + n) dt 

Figure 8: Markov modelling of jobs submittal 



At arbitrary time we could be in the state Login 
with probability A/(A + S) and in the state Logout 
with the probability S/(X + S). We have from the 
results above: 



= (r Logout \ 

Qo(0y \V Login) 

n = v + Qo 

Finally we obtain the result. 
Uo(t) = m 



x + s U 



(5.12) 
(5.13) 



mie 



-m 2 t 



J7l2e 



mi — f«2 



mi — TO2 



Where 

m 

mi + m 2 
rnim2 



A/j 



X + 5 
X + 5 + fj, 
A/j 



— M' 'Login 



(5.14) 



(5.15) 

(5.16) 
(5.17) 
(5.18) 



With A = or /i = 0, we obtain that no jobs 
are submitted (TZo(t) = 1). With 5 = 0, this is 
a Poisson process and TZo(t) — e~ Mt . Note that 



during a period of t there are in average ^Vhogint 
jobs submitted, we have also for small period t, 



fto(t) « 1 
We have also 

fto(0) = - 



l^P LoginX 
in 

Number of jobs submitted 



(5.19) 

(5.20) 
(5.21) 



Total duration 

Ho(t) could be estimated by splitting the arrival 
processes in intervals of duration t and estimating 
the ratio of intervals with no arrival. The error of 
this estimation is linear with t. Another issue is 
that the logs precision is not below one second. 

5.3 Model characteristics 

We have also these interesting properties: 



<(0) 

ftg(Q) 
K(Q) 2 

K(0) 

K(o) 



fJ^P Login 



-n 



= v Lc 



(5.22) 
(5.23) 

(5.24) 

(5.25) 
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6 SIMULATION AND VALIDATION 



Probability distribution of the duration between 
two jobs arrival is called an interarrival process. 
Interarrival process is a common metric in queuing 
theory. We have A(t) = T>o{t) + Q (t) with the 
initial condition that user just submits a job. This 
implies that user is logged. 

Vo(0) = 0.0, Q o (0) - 1.0 



A{t) = /!- 



mi — rn2 



+ 



TTl\C 



m 2 e 



mi — m 2 



mi — m 2 



A{t) = pe~ mit + (1 



(5.26) 



(5.27) 



(5.28) 



p)e-" l2t 
We have fi S [ml; m2] because 

(fi - mi)(/i - m 2 ) = /i 2 - (A + 5 + fi)fi + 5[i 
(fi - mi)(/i - m 2 ) = -Sfx < (5.29) 

So p £ [0;1], and we have an hyper-exponential 
interarrival law of order 2 with parameters p = 
(/i — m 2 )/(mi — m2),mi 1 m2- This result is co- 
herent with other experimental fitting results |23| 
Moreover any hyper-exponential law of order 2 may 
be modelled with the Markov chain described in fig- 
ure |H1 with parameters p — pm\ + (1 — p)m 2 , A = 
m\mil \l, 5 = mi + m-z — /j. — A 

Let calculate the mean interarrival time. 
Probability to have an interarrival time between 
9 and 6 + d9 is A{9) - A{0 + dO) = -A l {9)d6. The 



mean is 
A 



-9A'(0)d6= / A(6)d6 (5.30) 
o ./o 

A = ' (5.3!) 

Login *H 

Let compute the variance of interarrival distribu- 
tion. 



var 



var 
~A 2 



-(9 - A) A'(9)d9 



2 / 9A{9)d9-A 2 
6 1 1 



^ = CV Z = 1 + 2 



CV 2 = l + 2T 2 Lo 



(A + S) 2 



gout j 



(5.32) 
(5.33) 
(5.34) 
(5.35) 



Another interesting property is the number of 
jobs submitted by this model during a Login period. 
Let P n be the probability to receive n jobs during 
a Login period. We have: 



Pn = 
Pn = 



fi + 5 fi + 5 



~^5e- st dt 



r 



(5.36) 
(5.37) 



This is a geometric law. The mean number of jobs 
submitted by Login period is fi/5. 

5.4 Group model 

Groups are composed of users, either regular users 
sending jobs at regular time or users with a Lo- 
gin/Logout like behavior. Metrics defined below as 
the mean number of jobs sent by Login state, the 
mean submittal rate and probability of Login could 
represent an user behavior. 



Job submittal rate — i— 








i 



















































Figure 9: Users job submittal rates during their 
period of activity 

Figure [5] shows the sorted distribution of users 
submittal rate [nV Login)- Except for the highest 
values it is quite a straight line in logspace. This 
observation could be included in a group model. 



6 Simulation and validation 

We have done a simulation in Scheme directly 
using the Markov model. We began by fitting users 
behavior from the logs with our model. Like the 
frequency obtained from the logs, the model shows 
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Biomed user - 
Simulation (Error = 4.929e-3) - 
Poisson distribution fitting with the same mean 




Jobs sent per 5 minutes 



(a) Biomed user 1 




Jobs sent per 5 minutes 



(b) Biomed user 2 



Biomed user 
Simulation (Error = 1.1078e-2) 




Jobs sent per 5 minutes 
(c) Biomed user 3 




Jobs sent per 5 minutes 
(d) Biomed user 4 



Name 




6 


A 


Error 


Biomed user 1 


0.0837 


0.02079 


2.1e-4 


4.929e-3 


Biomed user 2 


0.0620 


0.01188 


1.2e-4 


3.534e-3 


Biomed user 3 


0.0832 


0.02475 


2.5e-4 


1.1078e-2 


Biomed user 4 


0.0365 


1.4285e-3 


1.075e-4 


8.78e-2 



Figure 10: Biomed simulation results 
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6 SIMULATION AND VALIDATION 
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Hyper-exponential fitting (order 2) 
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(a) LPC cluster Biomed user 





R0 probability for user 75' at DAS2 fsO cluster 

Hyper-exponential fitting (order 2) 
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(c) DAS2 fsO cluster most active user 







R0 probability for user 3 at NASA Xmes 

Hyper-exponential fitting (order 2) 
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(b) NASA Ames most active user 



! ' R0 probability' for user 35 at 'SDSC Blue Horizon 

Hyper-exponential fitting (order 2) 
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(d) SDSC Blue Horizon most active user 



Figure 11: Hyper-Exponential fitting of IZo for a Biomed LPC user and for the most active users at 
NASA Ames, DAS2 and SDSC Blue Horizon clusters. 



6.1 Other workloads comparison 
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a majority of intervals with no job arrival, possi- 
bly followed by a relatively flat area and a fast de- 
creases. Some fitting results are presented in fig- 
ures Norm used to fit real data is the max- 
imum difference between the two cumulative dis- 
tributions. We fitted the frequency data for each 
user. 

During a period oft there are in average fiPLogint 
jobs submitted. We evaluate the value of [iPhogin 
which is the average number of jobs submitted by 
seconds. We use that value when doing a set of sim- 
ulation in order to fit a known real user probability 
distribution. We have two free parameters, so we 
vary V Login between 0.0 and 1.0 and lambda which 
the inverse is the average time an user is Logout. 
Some results obtained are shown in figures ITUl 

H parameter decides of the frequency length of 
the curve. Without the Login behavior we would 
have obtained a classic Poisson curve of /x parame- 
ter, is the mean interarrival time during Login 
period. An idea to evaluate n would be to evaluate 
the job arrival rate during Login periods, but we 
lack that Login information. 

S and A are the Logout and Login parameters. 
What is really important is the ratio A/(A + S) 
which is V Login- This is the ratio between time 
user is active on the cluster and total time. 6 and 
T^Login are measures of the deviation from a clas- 
sic Poisson law. For instance, the mean number 
of jobs submitted by Login period is and the 
mean job submittal rate is [iP Login- For a same 
T^Login we could have very different scenarios. A 
user could be active for long time but rarely logged 
and another user could be active for short period 
with frequent login. 1/5 is the mean Login time, 
1/A is the mean Logout time. 

The 1Zq probability is essential for studying job 
arrival time. 1 — TZo(t) is the probability that be- 
tween time and t we have received at least one 
job. It is easier to fit the TZq distribution for an user 
than the interarrival distribution because we have 



more points. Figure 11(a) shows a typical graph of 
TZq for a Biomed user. It shows for instance that 
for intervals of 10000 seconds, this Biomed user has 
a probability of about 0.2 to submits one or more 
jobs. We have fitted this probability with hyper- 
exponential curve, that is a summation of expo- 
nential curves. There was too much noises for high 
interval time to fit that curve. In fact errors on TZq 
are linear with t. So we have smoothed the curve 



before fitting by averaging near points. TZq for this 
user was fitted with a sum of two exponentials. 

It seems that more than the Login / Logout be- 
havior there is also a notion of user activity. For 
example during the preparation of jobs or analysis 
phase of the results an user does not use the Grid 
and consequently the cluster at all. More than the 
Login and Logout state an Inactive state could be 
added to the model if needed. 

6.1 Other workloads comparison 

User number 3 is the most active user from the 



NASA Ames iPSC/860 workload f . Figure |TT(b) 
shows its IZo (t) probability, it is clearly hyper- 
exponential of order 2, as other users like number 
22 and 23. Other users like number 12 and 15 are 
more classical Poissonian users. 

DAS2 Clusters (sec note [JJ used also PBS and 
MAUI as their batch system and scheduler. The 
main difference we have with them is that they use 
Globus to co-allocate nodes on different clusters. 
We only have bag of tasks applications which in- 
teracts together in a pipeline way by files stored 
on SEs. Their fsO 144 CPUs cluster is quite similar 



with ours. Figure 11(c) shows the TZo(t) probability 
for their most active user and corresponding hyper- 
exponential fitting or order 2. 

SDSC Blue Horizon cluster (see note^l have a to- 
tal of 144 nodes. The lZo(t) distribution probability 
of their most active user was fitted with a hyper- 
exponential of order 2 in figure 11(d) 



7 Related works 

Our Grid environment is very particular and dif- 
ferent from common cluster environment as paral- 
lelism involved requires no interaction between pro- 
cesses and degree of parallelism is one for all jobs. 

To be able to completely simulate the node us- 
age we need not only the jobs submittal process but 
also the job duration process. Our runtime model 



tThe workload log from the NASA Ames iPSC/860 
was graciously provided by Bill Nitzberg. The work- 
load logs from DAS2 were graciously provided by 
Hui Li, David Groep and Lex Wolters. The work- 
load log from the SDSC Blue Horizon was graciously 
provided by Travis Earheart and Nancy Wilkins- 
Diehr. All are available at the Parallel Workload 
Archives http: //www. cs .huj i . ac . il/labs/parallel/workload/ 
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8 DISCUSSION 



is similar with the Downey model |2l)| for run- 
time which is composed of linear pieces in logspace. 
There is a strong correlation between successive 
jobs running time but it seems unlikely that a gen- 
eral model for duration may be made because it de- 
pends highly on algorithms and data used by users. 

Most other models use Poisson distribution for 
interarrival distribution. But evidences, like CV 
be much higher than one, demonstrate that ex- 
ponential distributions does not fit well the real 
data [2S1 ESI ■ The need of a detailed model was 
expressed in j2J]. With constant parameters our 
model exhibits a hyper-exponential distribution for 
interarrival rate and justify such a distribution 
choice. One strong benefit of our model is that 
it is general and could be used numerically with 
non-constant parameters at the expense of difficult 
fitting. 

8 Discussion 

What could be stated is that job maximum run 
times provided by users are essentially inaccurate, 
some authors are even not using this information 
for scheduling |2|. Maybe a better concept is the 
relative urgency of a job. For example on a grid 
software managers are people responsible for in- 
stalling software on cluster nodes by sending in- 
stallation jobs. Software manager jobs may be re- 
garded as more urgent than other jobs type. So 
sending jobs with an estimated runtime could be 
replaced by sending jobs with an urgency param- 
eter. That urgency could be established in part 
as a site policy. Each site administrator could de- 
fine some users classes for different kind of jobs and 
software used with different jobs priorities. For in- 
stance a site hosted in some laboratory might wish 
to promote its scientific domain more than other 
domain, or some specific applications might need 
quality of services like real time interaction. 

Another idea for scheduling is to have some sort 
of risk assessment measured during the scheduler 
decision. This risk assessment may be based on 
blocking probabilities obtained either from the logs 
or from some user behavior models. For example, 
it could be wise to forbid that a group or an user 
takes all the cluster at a given time but instead to 
let some few percents of it open for short jobs or 
low CPU consuming jobs like monitoring. 



Information System shows for a site the num- 
ber of job currently running and waiting. But it 
is not really the relevant metric in an on-line envi- 
ronment. A better metric for a cluster is the com- 
puting flow rate input and the computing flow rate 
capacity. A cluster is able to treat some amount 
of computation per unit of time. So a cluster is 
contributing to the Grid with some computation 
flow rate (in GigaFLOPS or TeraFLOPS). As with 
classical queuing theory if the input rate is higher 
than the capacity, the site is overloaded and the 
global performances are low due to jobs waiting to 
be processed. What happens is that the site receive 
more jobs that is is able to treat in a given time. 
So the queues begin to grow and jobs have to wait 
more and more before being started, resulting in 
performance decay. Similarly when the computa- 
tion submitted rate is lower than the site capacity 
the site is under-used. Job submittal have also to 
be fairly distributed according to the site capac- 
ity. For example, a site that is twice bigger than 
another site have to receive twice more computing 
request than the other site. But there is a problem 
to globally enforce this submittal scheme on all the 
Grid. This is why a local site migration policy may 
be better than a central migration policy done with 
the RB. 

To be more precise there are two different kinds 
of cluster flow rate metrics, one is the local flow 
rate and the other one is the global flow rate. The 
local computing flow rate is the flow rate that one 
job sees when reaching the site. The global flow 
rate is the computing flow rate a group of jobs 
see when reaching the site. That global flow rate 
is also the main measurement for meta-scheduling 
between sites. These two metrics are different, for 
instance we could have a site with a lot of slow ma- 
chines (low local flow rate and big global flow rate) 
and another site with only few supercomputers (big 
local flow rate and low global flow rate). But the 
most interesting metric for one job is the local flow 
rate. This means that if each job wants individu- 
ally to be processed at the best local flow rate site, 
this site will saturate and be globally slow. 

As far as all users and groups computation total 
flow rate is less than the site global flow rate or 
site capacity, there is no real fairness issue because 
there is no strife to access the site resources, there 
is enough for all. The problem comes when the 
sum of all computation flow rate is greater than 



17 



the site capacity, firstly this globally reduces the 
site performance, secondly the scheduler must take 
decision to share fairly these resources. The Grid 
is an ideal tool that would allow to balance the 
load between sites by migrating jobs 0- A site 
that share their resources and is not saturated could 
discharge another heavily loaded site. Some kind 
of local site flow control could maintain a bounded 
input rate even with fluctuating jobs submittal. For 
instance fairness between groups and users could 
be maintained by decreasing the most demanding 
input rate and distributing it to other less saturated 
sites. 

Another benefits is that applications computing 
flow rates may be partly expressed by users in their 
job requirements. Computing flow rate takes into 
account both the jobs sizes and their time limits. 
Fairness between users is then ensured if whatever 
may be flow values asked by each user, part granted 
to each penalizes no other one. Computing flow 
rate granted by a site to an application may depend 
on the applications degree of parallelism, that is for 
the moment the number of jobs. For instance it 
may be more difficult to serve an application com- 
posed of only one job asking for a lot of computing 
flow rate than to serve an application asking the 
same computing flow rate but composed of many 
jobs. Urgency is not totally measured by a com- 
puting flow rate. For example a critical medical 
application which is a matter of life or death ar- 
riving on a full site has to be treated in priority. 
Allocating flow rates between users and groups has 
to be right and to take under account priority or 
urgency issues. 

To use a site wisely users have to bound their 
computational flow rate and to negotiate it with 
site managers. A computing model has to be de- 
fined and published. These remarks are impor- 
tant in the case of on-line computing like Grids 
where meta-scheduling strategy have to take a lot 
of parameters into account. General on-line load 
balancing and scheduling algorithms 28, 29 
may be applied. The problem of finding the best 
suited scheduling policy is still an open problem. A 
better understanding of job running time is nece- 
ssary to have a full model. 

The LCG middleware allows users to send their 
jobs to different nodes. This is done by the way of a 
central element called a Resource Broker, that col- 
lects user's requests and distributes them to com- 



puting sites. The main purpose is to match the 
available resources and balance the load of job sub- 
mittal requests. Jobs are better localized near the 
data they need to use. 

We would like to advise instead a peer to peer |^2*] 
view of the Grid over a centralized one. In this 
view computing sites themselves work together 
with other computing sites to balance the aver- 
age workload. Not relying on dependent services 
greatly improves the reliability and adaptability of 
the whole systems. That kind of meta-scheduling 
have to be globally distributed as stated by Dmitry 
Zotkin and Peter J. Keleher \V2\: 

In a distributed system like Grid, the use of a 
central Grid scheduler^ may result in a performance 
bottleneck and lead to a failure of the whole system. 
It is therefore appropriate to use a decentralized 
scheduler architecture and distributed algorithm. 

gLite jnni is the next generation middleware for 
Grid computing. gLite will provide lightweight 
middleware for Grid computing. The gLite Grid 
services follow a Service Oriented Architecture 
which will facilitate interoperability among Grid 
services. Architecture details of gLite could be 
viewed in . The architecture constituted by this 
set of services is not bound to specific implemen- 
tations of the services and although the services 
are expected to work together in a concerted way 
in order to achieve the goals of the end-user they 
can be deployed and used independently, allowing 
their exploitation in different contexts. The gLite 
service decomposition has been largely influenced 
by the work performed in the LCG project. Ser- 
vice implementations need to be inter-operable in 
such a way that a client may talk to different inde- 
pendent implementations of the same service. This 
can be achieved in developing lightweight services 
that only require minimal support from their de- 
ployment environment and defining standard and 
extensible communication protocols between Grid 
services. 



9 Conclusion 

So far we have analyzed the workload of a Grid 
enabled cluster and proposed an infinite Markov- 
based model that describes the process of jobs ar- 
tlike the Resource Broker used in LCG middleware 
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rival. Then a numerical fitting has been done be- 
tween the logs and the model. We find a very simi- 
lar behavior compared to the logs, even bursts were 
observed during the simulation. 
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